The purpose of this assignment is to take the normalized expression data that was created in Assignment #1 and then rank genes according to differential expression. With that ranked list perform thresholded over-representation analysis(ORA) to highlight dominant genes/themes in the top set of genes, in Assignmnet 2. Using this information, to try and determine if there is indeed a link between the genes and the disease that they cause. Then see if using non-threshold analysis will give us the same if not similar results to lead us to state that there are genes responsible for the effects of the affected groups.
The next two sections will deal with providing background information from the paper and previous assignments.
The link to the journal can be found at the end of the document, or by clicking the related subheading in the table of contents above.
This was a really interesting paper, but I will save that discussion for the later part of this assignment. A prior transcriptome meta-analysis found significantly decreased levels of corticotropin-releasing hormone (CRH) mRNA in corticolimbic brain areas in MDD patients, indicating that cortical CRH-expressing (CRH+) cells are impaired in MDD. Although rodent studies reveal that cortical CRH is predominantly expressed in GABAergic interneurons, little is known about the characteristics of CRH+ cells in the human cerebral cortex and their relationship to MDD. Human volunteers without brain illnesses had their subgenual anterior cingulate cortex (sgACC) identified for CRH and markers of excitatory (SLC17A7), inhibitory (GAD1), and other interneuron subpopulations using fluorescent in situ hybridization (FISH) (PVALB, SST, VIP). Changes in CRH+ cell density and cellular CRH expression (n = 6/group) were investigated in MDD patients. RNA-sequencing was done on sgACC CRH+ interneurons from comparison and MDD participants (n = 6/group) to see if there were any variations between the two groups. In mice with TrkB function suppressed, the effect of decreased BDNF on CRH expression was investigated. GABAergic cells made up 80 percent of CRH+ cells, whereas glutamatergic cells made up 17.5 percent. VIP (52%) and SST (7%), as well as PVALB, were co-expressed by CRH+ GABAergic interneurons (7 percent ). MDD patients had lower CRH mRNA levels in GABAergic interneurons than control participants, despite no differences in cell density. The transcriptome profile of CRH+ interneurons suggests decreased excitability and less GABA release and reuptake. Further research revealed that these molecular alterations are not caused by altered glucocorticoid feedback, but rather occur downstream of a common neurotrophic function modulator.
Essentially, there was a strong relationship between the gene expression or lack thereof for individuals who suffered from MDD.
Here is a direct link to the query for this dataset. (GSE193417).
Here is a direct link to the paper that is associated with the dataset above. (PMID: 35280164) (PMCID: PMC8913899)
In Assignment 1, we created log density graphs, MDS plots, MA plots after we chose a expression dataset from GEO database so that we may use them for analysis and processing as one would do in the field of bioinformatics. After selecting this expression set, we are to retrieve it, map it, normalize it, and finally interpret it, by use of graphs and plots as mentioned earlier. The conclusion that I came to in Assignment 1 was that there was in fact a strong relationship between the genes, their expressions and the results the authors of the paper had.
The dataset (GSE193417), started with 19961 genes for each of the 12 Samples; CRH-Hu1001 sgACC_MDD, CRH-Hu103 sgACC_control, CRH-Hu1047 sgACC_control, CRH-Hu1086 sgACC_control, CRH-Hu513 sgACC_MDD, CRH-Hu600 sgACC_MDD, CRH-Hu615 sgACC_control, CRH-Hu789 sgACC_control, CRH-Hu809 sgACC_MDD, CRH-Hu852 sgACC_control. Where sgACC_control groups are individuals have unimpaired CRH+ cells whereas MDD individuals have impaired CRH+ cells. After cleaning and normalizing the data, by removing genecounts that were fewer than 6 due to each sample size having 6 particpants each, the dataset had the ensemble_gene_ids mapped to HGNC symbols for easy gene identification afterwhich the genecounts were normalized and plotted. Assignment 1 removed 23.11% of the original genes, which left me with a final gene count of 15349. Which is slightly lower than the paper’s genecount but that was most likely due to the authors using different cleaning and limitation methods. Which was my final data frame object , ‘FinalGeneFilter’. What I found in the normalized dataset after plotting was that there was a causal relationship, the same conclusions as the author. For more information about this please see the Figure 1, an MDS below and Figure 2, the density distribution curve, both using the normalized dataset as mentioned before.
In assignment 2, the ORA fully validate the authors’ conclusions in the original study (Oh, Hyunjung et al., 2022), because they ran a differential expression analysis of the genes themselves and obtained comparable results. They had 835 genes with differential expression, but I only had 608. ClueGo, a Cytoscape plugin, is mentioned in the study. The study’s authors discovered 307 genes, 168 upregulated, 139 downregulated, the control and with significant group difference of 528 gene sets, 267 upregulated, 261 downregulated, for those with MDD. To discover changed biochemical pathways in these interneurons, the scientists employed Gene Set Enrichment Analysis (GSEA) with complete transcriptome data. I want to emphasise that they didn’t eliminated outliers, or that they did it in a different way, genes whose gene symbols were detected by the GeneMANIA app were included in the study. As a consequence, eight genes were eliminated (AC005747.1, AC011448.1, CNMD, ECPAS, H4C4, LRATD2, SELENOK, SLC35E2A). but I can’t locate that in the study because it isn’t discussed. The 307 gene pathways that were identified and isolated, along with a number of genes, which can be referred to below.
So, to summarize and reiterate the results, there is a result and correlation between the genes, and their effects.
Set up all the data we used from Assignment 2, and by extension 1.
#check to see if the files are actually there. If not delete everything and try again.
if (!file.exists("GSE193417_normalized_datastruct.rds")) {
print("The required files do not exist, one moment while all the data is being configured.
If the notebook stops running, you might need to delete all the
associated files with this notebook and run it from the begining.")
} else {
print("The required files exist, you may continue to run this R-Notebook")
}
## [1] "The required files exist, you may continue to run this R-Notebook"
Conduct non-thresholded gene set enrichment analysis using the ranked set of genes from Assignment #2.
I will be using GSEA 4.1.0 from Docker image risserlin/em_base_image:version_with_bioc_3_13 per lectures and a prior homework. I will be using the current April release of a geneset from the Gary Bader lab for enrichment analysis, containing GO biological processes and pathways but no IEA, Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt
All results have been uploaded to GitHub as the main folder. with the main summary here.
This geneset uses HGNC symbols, so construct the ranked list for GSEA accordingly.
#Download the database
#Check to see or install the required files.
tryCatch(expr = { library("RCurl")},
error = function(e) { install.packages("RCurl")},
finally = library("RCurl"))
tryCatch(expr = { library("BiocManager")},
error = function(e) {
install.packages("BiocManager")},
finally = library("BiocManager"))
tryCatch(expr = { library("ggplot2")},
error = function(e) { install.packages("ggplot2")},
finally = library("ggplot2"))
#Code from Enrichment Map Protocol (Isserlin 2022)
#This is to search the repository at the website, then find the applicable geneset based on regex specifications.
if( !file.exists("Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt"))
{
gmt_url = "http://download.baderlab.org/EM_Genesets/March_01_2021/Human/symbol/"
filenames = getURL(gmt_url)
tc = BiocGenerics::textConnection(filenames)
contents = readLines(tc)
close(tc)
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
contents,
perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- paste0("./", gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)
pathways <- fgsea::gmtPathways("Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt")
} else {
pathways <- fgsea::gmtPathways("Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt")
}
#See this link https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Ranked_Gene_Lists
#Manual rank calculation to get an idea of the order of the genes,
rnk <- output_hits[output_hits$gene.hgnc_symbol != "" & !is.na(output_hits$gene.hgnc_symbol),]
#calculate the ranks for each of the different genes
rnk$rank <- -log10(rnk$P.Value) * sign(rnk$logFC)
#can also do it with the t value like so:
#alternaternk <- -log10(rnk$P.Value) * sign(rnk$t)
output_hits_test <- rnk[order(rnk$rank, decreasing = TRUE),]
ranked_genes <- data.frame(GeneName = (output_hits_test$gene.hgnc_symbol), rank = output_hits_test[,"rank"])
smallval = length(rownames(ranked_genes))-5
biggestval = length(rownames(ranked_genes))
#top 5 upregullated genes, lowestes 5 downregulated genes.
knitr::kable(ranked_genes[c(1:5, smallval:biggestval), ], type="pipe")
| GeneName | rank | |
|---|---|---|
| 1 | HADHB | 2.865543 |
| 2 | SCAPER | 2.747490 |
| 3 | PUM1 | 2.646464 |
| 4 | CEP41 | 2.604923 |
| 5 | PCGF5 | 2.533710 |
| 15336 | GAK | -2.920561 |
| 15337 | WIF1 | -2.927842 |
| 15338 | TAF1 | -3.090734 |
| 15339 | METRN | -3.150393 |
| 15340 | CCDC93 | -3.486619 |
| 15341 | GAS6 | -4.196226 |
#Save the ranked genes file to be used with GSEA, and Cytoscape later on.
#This gene list is in accordance with the format of GeneName and Rank.
write_tsv(ranked_genes, "GSEA_GSE193417_ranked.rnk")
Table 1. Shows the top 5 genes the and bottom 5 genes based on their gene rank according rank calculation of -log10(gene’s P.Value) * gene’s fold change value.
#Build intuition of the ordering.
fgseaRes <- fgsea(pathways = pathways,
stats = deframe(ranked_genes),
eps = 0.0,
minSize = 15,
maxSize = 200)
#top 6 pathways based on Normalised Enrichment Score i.e. the 6 most upregulated pathways
head(fgseaRes[order(NES, decreasing = TRUE), ])
#bottom six 6 pathways based on Normalised Enrichment Score i.e. the 6 most downregualted pathways
tail(fgseaRes[order(NES, decreasing = TRUE), ])
fgseaRes
Look at the top gene pathway for the head and the last gene which is the most downregulated. Compare them to the paper.
After manually creating an .rnk, I used the GSEA software and used the baderlab geneset collection from March 1, 2021 containing GO biological process, no IEA and pathways Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt as our geneset database.
I ran GSEAPreranked with the following parameters (GSEA v4.2.3):
Here is the full parameter and settings report Here is the full GSEA application output aw messages/log
read_lines("./A3_GSEA.1650051713721/GSEAlogOutpt.txt", )[0:20]
## [1] "< Process output will appear below >"
## [2] ""
## [3] "Done initing things while splashing"
## [4] "[1650051376718] [INFO] Loading ... C:\\Users\\hossa\\Downloads\\Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt"
## [5] "[1650051376734] [INFO] Begun importing: GeneSetMatrix from: Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt"
## [6] "[1650051377530] [INFO] Loaded file: C:\\Users\\hossa\\Downloads\\Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt"
## [7] "[1650051377532] [INFO] <html>Loading ... 1 files<br><br>Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt"
## [8] "<br>Files loaded successfully: 1 / 1<br>There were NO errors</html>"
## [9] "[1650051392141] [INFO] Begun importing: RankedList from: C:\\Users\\hossa\\Downloads\\GSEA_GSE193417_ranked.rnk"
## [10] "[1650051392194] [INFO] Loaded file: C:\\Users\\hossa\\Downloads\\GSEA_GSE193417_ranked.rnk"
## [11] "[1650051392197] [INFO] <html>Loading ... 1 files<br><br>GSEA_GSE193417_ranked.rnk"
## [12] "<br>Files loaded successfully: 1 / 1<br>There were NO errors</html>"
## [13] "[1650051404823] [INFO] <html><body><b>This will remove these files from this list (but NOT delete the files themselves)</b></body></html>"
## [14] ">> {rpt_label=GSEA_GSE193417_prerankd, rnd_seed=timestamp, set_min=15, chip=ftp.broadinstitute.org://pub/gsea/annotations_versioned/Human_HGNC_ID_MSigDB.v7.5.1.chip, zip_report=false, create_svgs=false, scoring_scheme=weighted, rnk=C:\\Users\\hossa\\Downloads\\GSEA_GSE193417_ranked.rnk, norm=meandiv, out=C:\\Users\\hossa\\gsea_home\\output\\apr15, mode=Abs_max_of_probes, include_only_symbols=true, set_max=200, gmx=C:\\Users\\hossa\\Downloads\\Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt, make_sets=true, plot_top_x=20, gui=false, nperm=1000, collapse=No_Collapse}"
## [15] "[1650051713740] [INFO] No ranked list collapsing was done .. using original as is"
## [16] "to parse>C:\\Users\\hossa\\Downloads\\Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt< got: [C:\\Users\\hossa\\Downloads\\Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt]"
## [17] "[1650051713774] [INFO] Timestamp used as the random seed: 1650051713721"
## [18] "[1650051713781] [INFO] Got gsets: 18727 now preprocessing them ... min: 15 max: 200"
## [19] "Done removeGeneSetsSmallerThan: 15 for: 501 / 6973"
## [20] "Done removeGeneSetsSmallerThan: 15 for: 1001 / 6973"
df <- read_tsv(‘C:/Users/bob/Downloads/data.tsv’, col_names=FALSE)
The raw data can be found here for na_pos Enrichment in phenotype: na_pos
158 gene sets are significantly upregulated at nominal pvalue < 5%. Top 5 results:
knitr::include_url("https://rawcdn.githack.com/bcb420-2022/Sabbir_Hossain/c5c7a7812d6eccaa9b2a352da118445c4b2b5ad2/A3/gsea_report_for_na_pos_1650051713721.html", height = "500px")
Table 2a. Shows the html preview of the na_pos for the gene pathways that were determined by GESA analysis.
The raw data can be found here for na_neg Enrichment in phenotype: na_neg
213 gene sets are significantly downregulated at nominal pvalue < 5%. Top 5 results:
knitr::include_url("https://rawcdn.githack.com/bcb420-2022/Sabbir_Hossain/c5c7a7812d6eccaa9b2a352da118445c4b2b5ad2/A3/gsea_report_for_na_neg_1650051713721.html", height = "500px")
Table 2b. Shows the html preview of the na_neg for the gene pathways that were determined by GESA analysis.
Many top terms for na_pos phenotype (up-regulated genes in sgACC cells of the human brains) are related to CRH+ expressing interneuron cells, which is not seen in thresholded analysis results for up-regulated genes. Whereas in thresholded analysis results, the top terms are related to cell development. Since the paper mentions the terms like; “negative regulation of CRH+ cells” and “SLC17A& excitatory markers” and, “inhibitory (GAD1) neurons”, as well as “markers of other interneuron subpopulations (PVALB, SST, VIP)” we can look for cells related to those keywords and functions. About 80% of CRH+ cells were GABAergic whereas 17.5% were glutamatergic. CRH+ GABAergic interneurons co-expressed VIP (52%), SST (7%), or PVALB (7%). Note that MDD subjects displayed lower CRH mRNA levels in GABAergic interneurons relative to comparison subjects without changes in cell density. We are looking for anything related to these subpopulations.
In the thresholded analysis results, I had GAD6 as the most negatively differentially expressed gene ans smallest Pvalue, while C2CD3 was the gene with the largest pvalue and MT-CO1 was also a gene of interest. While both terms appears in the GSEA results, there appear to be better pathways for some of the cases. This is most notable in the na_pos regulated section of the pathways, as the ones that are included in this are slightly different. The na_neg results do match up entirely. By looking at the two links above, the positive and negative enrichment respectively, we can
OLFACTORY SIGNALING PATHWAY%REACTOME%R-HSA-381753.2
SIGNAL RELEASE FROM SYNAPSE%GOBP%GO:0099643
Refer to the embedded tables above to see this.
At the same p-value of 0.05, GSEA discovered fewer substantially enriched gene sets than differentially-expressed genes found by thresholded analysis, however we should note that we are comparing “apples” to “oranges” perse, we are comparing entire genes to other genes in thresholded analysis, not so much their pathways. We do understand the types of genes that are being expressed and regulated in whatever manner, but in this regard we are not preliminary limiting the genes we are interested in, but instead waiting for the software to do it. because we are comparing gene sets to genes.
Using GSEA, similar upregulated and downregulated pathways were discovered. Cell signalling, inhibitory, regulatory, maintenance and cellular response are also linked in the top elevated processes/pathways. We can see this from a few of the same data sources, Reactome, GO:BP, and Wikipathways, show up and are shared from the A2, for example GO:BP has a lot of hits, which is indeed the same as what we had for A2, as during the Manhattan plots, we could see a lot of the gene expressions were plotted in that area. We can see that there is a lot of the same kinds of pathways being referenced, for example, there is a lot of mention of cellular; release, signalling, maintenance like “olfactory signaling pathway”, “polysaccharide catabolic process”, “phenol-containing compound metabolic process” for na_pos and signal “release from synapse”, “neurotransmitter secretion”. What this means essentially, is that there is a strong relationship both between those individuals who suffer from MDD, because their regulation and signalling pathways are working incorrectly and that individuals who do not suffer from MDD, with all their cellular functions are working, and also being highly expressed/used.
The qualitative comparison isn’t as obvious because one on hand I’m only comparing the top results, for each respective section, and while the sections and genes, and pathways in this section match up, it might not be the case for the “meat and potatoes” portions of each respective genelist. Furthermore, the data sources used are varied. MSigDB is one of the sources used by GSEA. There was also a query option in the g:Profiler result that returned the entire differentially expressed gene list, which is not available in GSEA. There were a lot of options, and areas where the data could be different, which is why in one such example, for the na_pos pathways we have a slight remix of the order of top hits, the na_neg still remain the same. This suggests that the methods used to find the trend, fix the data, clean it, all the steps we took thus far, etc. could have slightly varied the data, and order. Not enough to completely alter the OVERALL result, but enough to make me stop and say “wow, that’s interesting”. Given that for the ranked list we used to run GESA, wasn’t threshold and we let the software do the calculations and interpretations of the list we curated, I would reason that it is more accurate, the GESA outputs than my own analysis in A2. Not to say that I am entirely wrong, there is a link between CRH+ cells and whether or not they are GABAergic(good) or glutamatergic(not so good). There is a comparison to be made however, that there is indeed some kind of link between the pathways affected, and their effects on people. Genetic expression affects genetic functions, which affects genetic pathways. It is wrong to compare them as an “apples” to “apples” analysis as I said above, they are two different things, accomplishing two seperate things, but the results they give us is what supports further research, because there is a link, just as the authors (Oh, Hyunjung et al., 2022) had stated.
So in short it isn’t as straightforward and looking at it at face value to make a conclusion, you have to look closer and understand the details and what each is presenting, one is specifically about genes, DE ORAs, like what we did in A2 whilst what we are doing here is looking specifically at pathways and their representations, as well as interactions. I could argue that this is a better form of analysis as well, we see a bigger picture, it is more than the individual gene, which is essentially what cellular interations are all about.
MA plot of all the values from A2, which shows the hits and genes from the different annotation databases. This includes both downregulated and upregulated genes for this dataset. The large numbers of hits from HPA, GO:BP, GO:CC. GO:MF make sense because there are a number of genes that account for different processes, i.e. molecular function genes, cellular components, human proteins etc. and the fact that we have returns from these annotations databases make sense. We can verify this by just looking at some of the hits from our volcano plot earlier and see that many of the genes that are regulated do indeed fall under the functions and databases they have pinged from. Here is a list of them for reference; MT-CO1 GAS6, CCDC93, METRN, TAF1, GAK, WIF1.
Using your results from your non-thresholded gene set enrichment analysis visualize your results in Cytoscape.
Figure 2a. Number of nodes and edges for the generated network map.
Figure 2b. Network Legend for all the network maps.
Simple figure of all the clusters in the network map.
The default setting for the autoannoate settings for the network map.
This is laid out so it easier to read, and interpret. We can see that there is some degree of interactions and closeness based not only on the coloured interactions but also with the groupings that are arising.
This was just a figure I wanted to include to highlight the process of how important it is to actually be able to read the data that you are analysizing. I cannot see the interactions or ideas when the data is all smushed together. I think this is a much better way of going through the information.
Completely able to visualize and see the interactions based on the edges, nodes and neighbours.
This figure is just to demonstrate all the different interations and ideas that we get between the different gene pathways.
This figure is just to demonstrate all the different gene interations and ideas that we get between this dataset.
This network map is ordered with the top left having the largest size pathway and the other pathways related to it. We can see signalling pathways, and regulation pathways are some with the largest hits.